library(Stat2Data)
data("RailsTrails")
colnames(RailsTrails)
## [1] "HouseNum" "Acre" "AcreGroup" "Adj1998" "Adj2007"
## [6] "Adj2011" "BedGroup" "Bedrooms" "BikeScore" "Diff2014"
## [11] "Distance" "DistGroup" "GarageSpaces" "GarageGroup" "Latitude"
## [16] "Longitude" "NumFullBaths" "NumHalfBaths" "NumRooms" "PctChange"
## [21] "Price1998" "Price2007" "Price2011" "Price2014" "SFGroup"
## [26] "SquareFeet" "StreetName" "StreetNum" "WalkScore" "Zip"
model 1: A model that predicts the 2014 sale price based on an additive combination of the homes distance from the bike path, and it’s size in square feet.
model1 = lm(Price2014 ~ Distance + SquareFeet, data = RailsTrails)
summary(model1)
##
## Call:
## lm(formula = Price2014 ~ Distance + SquareFeet, data = RailsTrails)
##
## Residuals:
## Min 1Q Median 3Q Max
## -152.15 -30.27 -4.14 25.75 337.93
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.985 25.607 3.085 0.00263 **
## Distance -15.788 7.586 -2.081 0.03994 *
## SquareFeet 147.920 12.765 11.588 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.55 on 101 degrees of freedom
## Multiple R-squared: 0.6574, Adjusted R-squared: 0.6506
## F-statistic: 96.89 on 2 and 101 DF, p-value: < 2.2e-16
model 2: A model that predicts the 2014 sale price based on the homes distance from the bike path, it’s size in square feet, and the number of full bathrooms the home has, allowing for the effect of distance to depend on a home’s size
model2 = lm(Price2014 ~ NumFullBaths + Distance * SquareFeet, data = RailsTrails)
summary(model2)
##
## Call:
## lm(formula = Price2014 ~ NumFullBaths + Distance * SquareFeet,
## data = RailsTrails)
##
## Residuals:
## Min 1Q Median 3Q Max
## -163.563 -32.212 -0.187 21.010 306.570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.58 32.31 0.978 0.3307
## NumFullBaths 27.47 12.24 2.243 0.0271 *
## Distance 22.67 22.63 1.002 0.3189
## SquareFeet 155.32 20.01 7.761 7.77e-12 ***
## Distance:SquareFeet -30.76 16.67 -1.846 0.0679 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.62 on 99 degrees of freedom
## Multiple R-squared: 0.6836, Adjusted R-squared: 0.6708
## F-statistic: 53.48 on 4 and 99 DF, p-value: < 2.2e-16
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: Price2014 ~ Distance + SquareFeet
## Model 2: Price2014 ~ NumFullBaths + Distance * SquareFeet
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 101 433959
## 2 99 400711 2 33248 4.1072 0.01934 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Why is a nested F-test an appropriate way to compare these two models?
all of the predictor variables in model1 (distance from bike path and size in sq ft) are contained in the set of predictor variables from model2 (distance, size, and number of bedrooms). therefore, we can compare these two models to determine if the higher number of predictors and interactions makes a better model.
Which model is the “full” model, and which is the “reduced” model? How can you tell?
the full model is model2, because it contains all of the predictors in model1 (model1 is nested within). model1 doesn’t contain all of model2’s predictors.
What null hypothesis does this nested F-test assess?
the bested f-test’s hypotheses are below:
\(H_0: \beta_i = 0\) for any \(\beta_i \in W\) \(H_a: \beta_i \neq 0\) for any \(\beta_i \in W\)
Based on the results of this nested F-test, which model of home prices should preferred: Model 1 or Model 2? Why?
this model tells us that the \(SSModel_{full} - SSModel_{reduced} = 33248\), and that \(k_{full} - k_{reduced} = 2\). The F-statistic is \(4.1072\), with a significant p-value of \(0.01934\), which tells us that the changes implemented by adding the number of bedrooms and allowing the interaction between distance and square footage had a significant impact on the model’s capacity.
Based on your model comparison above, can you tell whether adding
the \(Distance \cdot SquareFeet\)
interaction is responsible for the increased goodness of fit in the full
model, or whether the increase is due to adding the
NumFullBaths to the model?
no – we cannot determine what the cause of the change in sigificance is because of the inclusion of another predictor or because we allowed predictors to interact, because we changed both those things from the original model. thus we only know how much model2 varied from model1, and that there were two changes made, but we can’t determine if that’s because of either change with what we have so far.
library(tidyverse)
library(ggplot2)
fishing <- read_csv("https://wjhopper.github.io/SDS-201/data/fishing.csv",
name_repair = "universal") |>
select(Year, Lake, Walleye, Channel.Catfish)
lake_model = lm(Walleye ~ Lake, data = fishing)
summary(lake_model)
##
## Call:
## lm(formula = Walleye ~ Lake, data = fishing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3351.5 -165.9 -37.8 73.3 12053.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3351.5 130.3 25.713 <2e-16 ***
## LakeHuron -1786.3 179.0 -9.982 <2e-16 ***
## LakeMichigan -3175.5 184.3 -17.227 <2e-16 ***
## LakeOntario -3313.7 188.9 -17.540 <2e-16 ***
## LakeSaint Clair -3186.7 199.8 -15.947 <2e-16 ***
## LakeSuperior -3268.1 179.0 -18.261 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1492 on 768 degrees of freedom
## (97 observations deleted due to missingness)
## Multiple R-squared: 0.4043, Adjusted R-squared: 0.4004
## F-statistic: 104.2 on 5 and 768 DF, p-value: < 2.2e-16
# plot(emmeans(lake_model, specs = ~ Lake),
# horizontal = FALSE,
# ylab = "Amount of Walleye caught (1000s of lbs)")
how different is the amount of Walleye caught on average between Lake Ontario and Lake Michigan
# avg_walleye <- emmeans(lake_model, specs = ~ Lake)
# contrast(avg_walleye, method = "consec")
Fit a regression model that predicts the amount of Walleye caught based on an interaction between the lake being fished, and the amount of Channel Catfish caught
walleye_model <- lm(Walleye ~ Channel.Catfish * Lake,
data = filter(fishing, Lake %in% c("Erie", "Huron")))
summary(walleye_model)
##
## Call:
## lm(formula = Walleye ~ Channel.Catfish * Lake, data = filter(fishing,
## Lake %in% c("Erie", "Huron")))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4754.9 -1292.3 -215.3 998.1 9524.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3232.6884 361.7729 8.936 3.07e-16 ***
## Channel.Catfish 1.4802 0.4623 3.202 0.001595 **
## LakeHuron -1176.9270 566.3012 -2.078 0.038994 *
## Channel.Catfish:LakeHuron -2.6783 0.7939 -3.374 0.000895 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2430 on 195 degrees of freedom
## (99 observations deleted due to missingness)
## Multiple R-squared: 0.2791, Adjusted R-squared: 0.268
## F-statistic: 25.17 on 3 and 195 DF, p-value: 8.274e-14
ggplot(data = filter(fishing, Lake %in% c("Erie", "Huron")),
mapping = aes(x = Channel.Catfish,
y = Walleye,
color = Lake)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x) +
labs(x="Amount of Channel Catfish caught (1000s of lbs)",
y="Amount of Walleye caught (1000s of lbs)")
## Warning: Removed 99 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 99 rows containing missing values or values outside the scale range
## (`geom_point()`).
What is the difference in the average amount of Walleye caught in Lake Erie and the average amount of Walleye caught in Lake Huron when 200 thousand pounds of Channel Catfish are caught?
# walleye_grid <- emmeans(walleye_model, specs = ~ Channel.Catfish * Lake,
# at = list(Channel.Catfish = 200))
# walleye_grid
# contrast(walleye_grid, method = "pairwise", by="Channel.Catfish")
library(Stat2Data)
library(regplanely)
library(ggplot2)
library(performance)
data("Diamonds")
head(Diamonds)
## Carat Color Clarity Depth PricePerCt TotalPrice
## 1 1.08 E VS1 68.6 6693.3 7228.8
## 2 0.31 F VVS1 61.9 3159.0 979.3
## 3 0.31 H VS1 62.1 1755.0 544.1
## 4 0.32 F VVS1 60.8 3159.0 1010.9
## 5 0.33 D IF 60.8 4758.8 1570.4
## 6 0.33 G VVS1 61.5 2895.8 955.6
ggplot(data = Diamonds, mapping = aes(y = TotalPrice,
x = Carat)) +
geom_point() +
# geom_smooth(method = lm, formula = y ~ x) +
theme_bw()
model = lm(TotalPrice ~ Carat * PricePerCt, data = Diamonds)
summary(model)
##
## Call:
## lm(formula = TotalPrice ~ Carat * PricePerCt, data = Diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.112023 -0.025374 0.002803 0.026619 0.080538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.096e-02 8.836e-03 -1.240e+00 0.2158
## Carat 6.831e-03 1.049e-02 6.510e-01 0.5153
## PricePerCt 2.936e-06 1.629e-06 1.802e+00 0.0724 .
## Carat:PricePerCt 1.000e+00 8.663e-07 1.154e+06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03583 on 347 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.501e+12 on 3 and 347 DF, p-value: < 2.2e-16
check_model(model, check = c("qq", "normality", "linearity", "homogeneity"))
while there are some mild violations of the assumptions made for the
model, especially as fitted values increase in magnitude, the model is
mostly sound.
regression_plane(model)
this model has an adjusted \(R^2\) value of 1, which means a great deal of the variation is explained by our model. this model tell us that for each increase in increase in Carat by one, price increases by 1 + 6.831e-03.
library(tidyverse)
library(ggplot2)
library(dplyr)
lung_volume = read.csv("https://bit.ly/lung_volume")
head(lung_volume)
## age fev ht sex smoke
## 1 9 1.708 57.0 F No
## 2 8 1.724 67.5 F No
## 3 7 1.720 54.5 F No
## 4 9 1.558 53.0 M No
## 5 9 1.895 57.0 M No
## 6 8 2.336 61.0 F No
ggplot(data = lung_volume, aes(x = smoke, y = fev)) +
geom_boxplot() +
theme_bw()
Visualize a regression model that estimates how smoking affects FEV, taking into account a persons age. Your model should allow the effect of smoking to depend on a person’s age.
ggplot(data = lung_volume, mapping = aes(y = fev,
color = smoke,
x = age)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x, se = F) +
theme_bw()
How does age affect lung volume among non-smokers? How does age affect lung volume among smokers?
smoking_model = lm(fev ~ age * smoke, data = lung_volume)
summary(smoking_model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2533955 0.08265075 3.065859 2.260474e-03
## age 0.2425584 0.00833154 29.113274 6.504859e-120
## smokeYes 1.9435707 0.41428463 4.691390 3.309624e-06
## age:smokeYes -0.1627027 0.03073753 -5.293290 1.645078e-07
age:smokeYes is the change in slope for the
smokers, indicating that for each increase in age by
one, the fev value increases by -0.163 + 0.243 =
0.080.age is the slope for the baseline group, aka
non-smokers, indicating that for each increase in age
by one, fev increases by 0.243.Do you believe that smoking changes the relationship between age and lung volume? Support your answer using information from the regression table
yes – this model indicates that for people who smoke, as they get
older their lung capacity actually begins to decrease, whereas for
non-smokers, their lung capacity tends to increase as they get older.
the p-values in this table indicate that the difference in the change in
slope between the non- and smokers is significant
(age:smokeYes = 1.645e-07).
Is there a difference in lung volume between an averaged age person who smokes and an average aged person who doesn’t smoke?
the existing model is based on smokers being 0 years, but we can mean-recenter the X value to compare lung volumes when \(age - \bar{x}_{age} = 0\), which is when age = average age.
ggplot(data = lung_volume, aes(x = age - mean(age),
y = fev,
color = smoke)) +
geom_point() +
geom_smooth(se = F, method = lm, formula = y ~ x) +
theme_bw()
lung_volume = mutate(lung_volume,
mc_age = age - mean(age))
mc_lung_model = lm(fev ~ mc_age * smoke, data = lung_volume)
summary(mc_lung_model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6622898 0.02305217 115.489795 0.000000e+00
## mc_age 0.2425584 0.00833154 29.113274 6.504859e-120
## smokeYes 0.3277391 0.12861470 2.548225 1.105618e-02
## mc_age:smokeYes -0.1627027 0.03073753 -5.293290 1.645078e-07
this new model tells us the change in slope for average-aged smokers is 0.327 more than the change in slope for average-aged non-smokers. we can tell that difference is significant because the the p-value for that difference is 0.011 < 0.05.
library(tidyverse)
library(moderndive)
library(performance)
nyc_airport_weather = read.csv("https://bit.ly/nyc_airport_weather")
nyc_airport_weather = nyc_airport_weather %>%
filter(!is.na(avg_temp))
head(nyc_airport_weather)
## month airport avg_temp avg_humid
## 1 1 EWR 35.56216 62.12451
## 2 1 JFK 35.38555 61.66162
## 3 1 LGA 35.95927 59.15609
## 4 2 EWR 34.26332 63.33558
## 5 2 JFK 34.19246 62.55848
## 6 2 LGA 34.35612 60.68603
Visualize a parallel slopes model that uses temperature and airport to predict the humidity level
ggplot(data = nyc_airport_weather, mapping = aes(y = avg_humid,
x = avg_temp,
color = airport)) +
geom_point() +
geom_parallel_slopes(se = F) +
theme_bw()
Fit the same parallel slopes model you visualized and report the regression table
airport_model = lm(avg_humid ~ avg_temp + airport, data = nyc_airport_weather)
summary(airport_model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.9468963 2.74595125 20.374322 1.648654e-19
## avg_temp 0.1280757 0.04549216 2.815337 8.394683e-03
## airportJFK 2.2716649 1.74601374 1.301058 2.028265e-01
## airportLGA -3.7458961 1.74800212 -2.142959 4.007585e-02
check_model(airport_model, check = c("qq", "normality", "linearity", "homogeneity"))
from this check, we see that there are some mild assumption violations in the normality of residuals tests, but otherwise, this model is mostly appropriate.
Match each of the labeled line segments in the figure below to the coefficient in the equation that it best represents, and also note the coefficient’s fitted from the regression table.
What is the estimated humidity at Newark Liberty International Airport when the average monthly temperature is zero degrees?
new_data = data.frame(avg_temp = 0, airport = "EWR")
predict(airport_model, new_data)
## 1
## 55.9469
Imagine that a flight that was schedule to land at Newark Liberty International Airport got diverted to LaGuardia airport, and the temperature was 64 degrees. Would it be more humid or less humid at LaGuardia at Newark when the flight lands? By how much?
predict(airport_model, data.frame(avg_temp = 64, airport = "EWR"))
## 1
## 64.14374
predict(airport_model, data.frame(avg_temp = 64, airport = "LGA"))
## 1
## 60.39785
based on this model, we can predict that it would be less humid at LGA when the plane lands, than it would be at EWR, given that the temperature was 64 at both.
library(Stat2Data)
library(ggplot2)
data("BaseballTimes2017")
head(BaseballTimes2017)
## Game League Runs Margin Pitchers Attendance Time
## 1 CHC-ARI NL 11 5 10 39131 203
## 2 KCR-CHW AL 9 3 7 18137 169
## 3 MIN-DET AL 13 5 10 29733 201
## 4 SDP-LAD NL 7 1 6 52898 179
## 5 COL-MIA NL 9 3 10 20096 204
## 6 CIN-MIL NL 21 1 10 34517 235
pitchers_model = lm(Time ~ Pitchers, data = BaseballTimes2017)
summary(pitchers_model)
##
## Call:
## lm(formula = Time ~ Pitchers, data = BaseballTimes2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.21 -11.19 -5.25 5.06 31.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.069 23.412 5.299 0.000189 ***
## Pitchers 8.017 2.722 2.946 0.012239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.52 on 12 degrees of freedom
## Multiple R-squared: 0.4197, Adjusted R-squared: 0.3713
## F-statistic: 8.678 on 1 and 12 DF, p-value: 0.01224
ggplot(data = BaseballTimes2017, mapping = aes(x = Pitchers,
y = Time)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x) +
theme_bw()
Extend the simple linear regression model above by including the total number of runs scored as an explanatory variable
pitcher_run_model = lm(Time ~ Pitchers + Runs, data = BaseballTimes2017)
summary(pitcher_run_model)
##
## Call:
## lm(formula = Time ~ Pitchers + Runs, data = BaseballTimes2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.789 -10.682 -2.683 5.260 34.211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.135 21.410 6.265 6.13e-05 ***
## Pitchers 2.787 3.528 0.790 0.4462
## Runs 3.262 1.600 2.039 0.0663 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.59 on 11 degrees of freedom
## Multiple R-squared: 0.5788, Adjusted R-squared: 0.5022
## F-statistic: 7.558 on 2 and 11 DF, p-value: 0.008605
How should you interpret the Runs coefficient in the
context of these data?
The Runs coefficient tells us that, based on this
sample, the model predicts that for each increase in Runs
by one, and assuming the PItchers value holds constant at
any given value, the time will increase by 3.262 minutes.
Is there still good evidence for a linear relationship between the number of pitchers and game duration, even after taking into account the number of runs scored in a game?
No. There’s a higher correlation between the number of Runs and the duration of the games than there is between the number of pitchers and the duration of the game, as evidenced by Runs’ p-value being power than Pitchers’.
Why does the effect of Pitchers appear smaller, once you account for the number of runs scored?
This is an multiple linear regression model, meaning the coefficients of the explanatory variables, which have their own relationship, are adjusted to account for the simultaneous effects – the effect of Runs ~ Pitchers and Time ~ Runs + Pitchers. This results in the effect of Pitchers looking smaller when Runs is accounted for.
library(Stat2Data)
library(tidyverse)
data("MothEggs")
moth_model = lm(Eggs ~ BodyMass, data = MothEggs)
summary(moth_model)
##
## Call:
## lm(formula = Eggs ~ BodyMass, data = MothEggs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -157.586 -17.187 3.162 25.790 67.960
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.38 45.38 0.537 0.59423
## BodyMass 79.86 26.69 2.992 0.00492 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.75 on 37 degrees of freedom
## Multiple R-squared: 0.1948, Adjusted R-squared: 0.173
## F-statistic: 8.95 on 1 and 37 DF, p-value: 0.004916
ggplot(data = MothEggs, mapping = aes(x = BodyMass, y = Eggs)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x) +
scale_x_continuous(breaks = seq(1.2, 2.3, by = 0.1)) +
theme_bw()
slope_coeff = 79.86
std_error = sd(MothEggs$BodyMass)
alpha = 0.05
df = nrow(MothEggs)- 2
t_crit = qt(1 - alpha/2, df)
margin_error = t_crit * std_error
lower = slope_coeff - margin_error
upper = slope_coeff + margin_error
for a moth with a body mass of 1.3 grams, the possible range of eggs is about 100 to 150.
if an 84% confidence interval was constructed around the fitted slope coefficient from each sample, i would predict that 0.84*200 = 164 of the intervals would capture the true slope of the Eggs ~ BodyMass regression model.
library(tidyverse)
set.seed(1)
n = 20
# simulated_data = data.frame(dataset_id = rep(1:10000, each = n),
# x = rnorm(n, mean = 10, sd = 5)) %>%
# mutate(y = 3 + rnorm(n*10000, mean = 0, sd = 2))
# extract_slope_and_std_error = function(model) {
# reg_table = summary(model)$coefficients
# stats = data.frame(Slope = reg_table[2, "Estimate"],
# Std_Error = reg_table[2, "Std. Error"])
# return(stats)
# }
#
# ## generate slopes from simulated data grouped by id number, then summarize w/ slope and std error
# slopes = simulated_data %>%
# group_by(dataset_id) %>%
# summarise(extract_slope_and_std_error(lm(y ~ x)))
#
# head(slopes)
## t-value is the slope/std-error value
# slopes = slopes %>%
# mutate(t = Slope / Std_Error)
#
# head(slopes)
visualize the t-statistic distribution
# ## data is slopes, put on the x-axis the t-statistics
# ggplot(slopes, aes(x = t)) +
# ## make a histogram with 40 bins that shows density
# geom_histogram(mapping = aes(y = after_stat(density)),
# bins = 40) +
# ## do colors
# geom_function(color = "blue", fun = function(x) { dt(x, df =10000-2)}) +
# ## set the bounds for the x axis
# scale_x_continuous("t statistics for slope", limits = c(-5, 5))
qt() function looks at the quantiles of a distribution
these distributions are symmetrical, which is why these are the +-
versions of the same value
## has only 1% of results above it
qt(0.01, df = 10000-2)
## [1] -2.326721
## has 99% of all results above it
qt(0.99, df = 10000-2)
## [1] 2.326721
we do this by using a exponential distribution instead of a normal one
set.seed(1)
# bad_data = data.frame(dataset_id = rep(1:10000, each = n),
# x = rnorm(n, mean = 10, sd = 5)) %>%
# mutate(y = 3 + rexp(n*10000, rate = 3))
#
# head(bad_data)
# bad_model_slopes = bad_data %>%
# group_by(dataset_id) %>%
# summarize(extract_slope_and_std_error(lm(y ~ x))) %>%
# mutate(t = Slope / Std_Error)
slopes of exponential distribution
# ggplot(data = bad_model_slopes, mapping = aes(x = t)) +
# geom_histogram(aes(y = after_stat(density)),
# bins = 40, fill = "red", alpha = 0.5) +
# geom_function(color = "blue", fun = function(x) dt(x, df = n - 2)) +
# scale_x_continuous("t statistics for slope", limits = c(-5, 5))
I would have expected this distribution to not be unimodal and mostly symmetrical – rather, because the values are exponential, I would have expected the majority of values to be further from the center, so more of a bowl shape than a hill.
this value has 99% of all values in the distribution above it
# bad_model_slopes %>%
# summarize(q = quantile(t, 0.01))
this value has only 1% of values in the rest of the distribution above it.
this value is closer to zero than the 0.01 quantile, implying the distribution isn’t symmetrical
# bad_model_slopes %>%
# summarize(q = quantile(t, 0.99))
the t-distribution should be fully symmetrical but this one is not, meaning you can’t make two-way test inferences with accuracy.
library(infer)
library(ggplot2)
NCbirths = read.csv("https://bit.ly/ncbirths")
head(NCbirths)
## fage mage mature weeks term visits gained weight sex smoker
## 1 22 20 younger mom 32 premie 5 40 2.69 male 1
## 2 32 32 younger mom 38 full term 10 42 8.88 male 0
## 3 34 33 younger mom 38 full term 18 6 7.06 female 0
## 4 NA 18 younger mom 42 full term 15 27 7.44 male 0
## 5 NA 22 younger mom 40 full term 18 54 6.38 male 1
## 6 43 39 mature mom 38 full term 17 40 7.50 female 0
we are looking at the relationship between the age of the mother and the weight of the baby
plot = ggplot(data = NCbirths, mapping = aes(x = mage, y = weight)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x, se = FALSE) +
theme_bw()
plot
the actual slope of the distribution
observed_slope = NCbirths %>%
specify(weight ~ mage) %>%
calculate(stat = "slope")
generate the random distribution using permutations
set.seed(54321)
permutation_dist = NCbirths %>%
specify(weight ~ mage) %>% ## variables and relationship?
hypothesize(null = "independence") %>% ## set the null hypothesis there's no relationship
generate(reps = 1000, type = "permute") %>% ## generate the random distributions
calculate(stat = "slope") ## calculate the slope of each distribution
visualize the random distribution
visualize(permutation_dist) +
shade_p_value(obs_stat = observed_slope, direction = "both")
## Warning in (function (mapping = NULL, data = NULL, stat = "identity", position = "identity", : All aesthetics have length 1, but the data has 1000 rows.
## ℹ Did you mean to use `annotate()`?
generate a p-value
p_value = permutation_dist %>%
get_p_value(obs_stat = observed_slope, direction = "both")
p_value
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.072
the slope of the line in figure 1 (first visualization) is the slope
between the original distribution, aka observed_slope:
## Response: weight (numeric)
## Explanatory: mage (numeric)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 0.0134
Using the visualize() function displays a histogram. Ignoring any red shaded regions, what does this histogram itself represent?
this visualization shows the distribution of the slope values for the 1000 random permutations of the dataset. so it shows the number of distributions (y-axis) with a certain slope value (x-axis).
What are the chances of observing a slope as extreme or more extreme as the one observed in these data, if the mother’s age and the babies birth weight are truly unrelated to one another?
given the mother’s age has no impact on the weight of the baby, the
chances of observing a slope as extreme as the one observed in the data
is the p-value of the original data in the randomization distribution,
aka p_value
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.072
according to this model, there is a non-significant probability that the mother’s age has a correlation with the baby’s weight.
library(Stat2Data)
library(ggplot2)
library(broom)
library(equatiomatic)
## load data
data("BaseballTimes2017")
head(BaseballTimes2017)
## Game League Runs Margin Pitchers Attendance Time
## 1 CHC-ARI NL 11 5 10 39131 203
## 2 KCR-CHW AL 9 3 7 18137 169
## 3 MIN-DET AL 13 5 10 29733 201
## 4 SDP-LAD NL 7 1 6 52898 179
## 5 COL-MIA NL 9 3 10 20096 204
## 6 CIN-MIL NL 21 1 10 34517 235
plot = ggplot(BaseballTimes2017, mapping = aes(x = Pitchers,
y = Time)) +
geom_point() +
theme_bw() +
geom_smooth(method = lm, se = FALSE, formula = y ~ x)
plot
a: Fit the regression model visualized above
baseball_model = lm(Time ~ Pitchers, BaseballTimes2017)
b: Print the model summary
summary(baseball_model)
##
## Call:
## lm(formula = Time ~ Pitchers, data = BaseballTimes2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.21 -11.19 -5.25 5.06 31.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.069 23.412 5.299 0.000189 ***
## Pitchers 8.017 2.722 2.946 0.012239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.52 on 12 degrees of freedom
## Multiple R-squared: 0.4197, Adjusted R-squared: 0.3713
## F-statistic: 8.678 on 1 and 12 DF, p-value: 0.01224
c: Interpret the model’s slope coefficient in a sentence.
This model indicates that for each increase in pitchers by one during a game, the duration of the game tends to last 8.017 minutes more.
Use the fitted model’s equation to compute the residual error for
Game #10, between the Los Angeles Angels and the Seattle Mariners, a
game where 9 pitchers were used (feel free to use pen and paper when
working with the equation). Finally, check your work using the
augment() function from the broom() package to
inspect the precise residual error.
isolate the coefficients
baseball_coefs = coef(baseball_model)
baseball_coefs
## (Intercept) Pitchers
## 124.068966 8.017241
generate the estimated time
## Game 10 time = slope * # of pitchers + error
g10_time_est = 8.017241*9 + 124.068966
g10_time_est
## [1] 196.2241
actual time:
## access baseball times at row 10, column 7 (game 10, time)
g10_time = BaseballTimes2017[10, 7]
g10_time
## [1] 184
calculate the residual by subtracting the estimate from the actual value:
g10_resid = g10_time - g10_time_est
g10_resid
## [1] -12.22413
calculate the residuals for all rows from the computer’s model
residuals = augment(baseball_model)
isolate game 10’s actual residual
## # A tibble: 1 × 1
## .resid
## <dbl>
## 1 -12.2